Introduction to SurvMarker

SurvMarker is an R package designed for PCA-based weighted feature scoring method in high-dimensional molecular data, such as gene or miRNA expression matrices. The package provides a statistically principled workflow that integrates PCA with time-to-event outcomes to identify features whose variation is systematically associated with patient survival. Rather than relying on arbitrary thresholds or single-PC effects, SurvMarker aggregates survival-relevant signals across multiple PCs and calibrates feature importance using empirical null distributions.

Dependencies

survival, ggplot2, VennDiagram

PCA-based feature scoring

Main function call

pca_based_weighted_score(
  X,
  time,
  status,
  covar = NULL,
  n_pcs = 50,
  cumvar_threshold = NULL,
  max_pcs = 50,
  pc_fdr_cutoff = 0.05,
  feature_fdr_cutoff = 0.05,
  null_B = 500,
  seed = 1,
  scale_pca = TRUE,
  use_abs_loadings = TRUE,
  store_null = TRUE,
  verbose = TRUE
)

Function arguments

Argument Description
X Expression matrix of dimension n_samples × p_features. Rows correspond to samples and columns to molecular features (e.g., genes, miRNAs, proteins).
time Survival times, ordered consistently with the rows of X.
status Event indicator (1 = event, 0 = censored).
covar Clinical covariates to include in Cox regression models.
n_pcs Number of principal components to retain. Ignored if cumvar_threshold is specified.
cumvar_threshold Minimum cumulative variance threshold used to determine the number of PCs retained.
max_pcs Hard upper bound on the number of PCs used (safety cap).
pc_fdr_cutoff False discovery rate cutoff for selecting survival-associated principal components.
feature_fdr_cutoff False discovery rate cutoff for selecting prognostic molecular features.
null_B Number of empirical null resamples used for feature-level inference.
seed Random seed for reproducibility.
scale_pca Logical: Whether to scale features prior to PCA.
use_abs_loadings Logical: Whether absolute loadings are used in feature score aggregation.
store_null Logical: Whether to store the full empirical null score matrix.
verbose Logical: Whether to print progress messages during execution.

Returned values

Component Description
feature_table Data frame containing feature loadings on survival-associated PCs, aggregated feature scores (Sj), empirical p-values, and false discovery rates (FDR).
pc_table PCA summary table including eigenvalues, proportion and cumulative variance explained, Cox regression coefficients, and adjusted p-values for each PC.
pc_scores Sample-level principal component scores used for downstream visualization and clustering.
null_scores Empirical null score matrix (features × null_B), returned when store_null = TRUE.
selected_features Character vector of prognostically significant features passing feature-level FDR control.

Visualizations

Example 1: Gene expression data from TCGA-LAML cohort

PCA diagnostics

Survival-associated structure

plot_pc12() and plot_top2_survival_pcs() visualize survival-relevant latent structure using PC scores annotated by clinical or molecular groups.

Feature-level inference

plot_null_vs_observed() contrasts observed feature scores against empirical null distributions, distinguishing significant from non-significant features.

Sensitivity to PC choice and feature stability assessment

plot_venn() visualizes overlap of selected features across PC choices and plot_feature_set_tradeoff() summarizes the relationship between cumulative variance explained and feature set size.

Example 2: miRNA expression data from TCGA-LAML cohort

PCA diagnostics

Survival-associated structure

Feature-level inference

Sensitivity to PC choice and feature stability assessment

Notes

  • SurvMarker is designed for survival-associated biomarker discovery and builds on classical survival analysis, particularly the Cox proportional hazards model.
  • Users should have basic knowledge with survival analysis concepts, particularly the input dataset.
  • An event denotes the event of interest (e.g., death, relapse, or progression) and is represented by a survival time and an event indicator (1 = event, 0 = censored). Here censored represents the event of interest was not observed for a subject during the study period.
  • Given a molecular feature matrix and corresponding survival time and status vectors, SurvMarker can be applied directly.